Similarity search is critical for many database applications, including theincreasingly popular online services for Content-Based Multimedia Retrieval(CBMR). These services, which include image search engines, must handle anoverwhelming volume of data, while keeping low response times. Thus,scalability is imperative for similarity search in Web-scale applications, butmost existing methods are sequential and target shared-memory machines. Here weaddress these issues with a distributed, efficient, and scalable index based onLocality-Sensitive Hashing (LSH). LSH is one of the most efficient and populartechniques for similarity search, but its poor referential locality propertieshas made its implementation a challenging problem. Our solution is based on awidely asynchronous dataflow parallelization with a number of optimizationsthat include a hierarchical parallelization to decouple indexing and datastorage, locality-aware data partition strategies to reduce message passing,and multi-probing to limit memory usage. The proposed parallelization attainedan efficiency of 90% in a distributed system with about 800 CPU cores. Inparticular, the original locality-aware data partition reduced the number ofmessages exchanged in 30%. Our parallel LSH was evaluated using the largestpublic dataset for similarity search (to the best of our knowledge) with $10^9$128-d SIFT descriptors extracted from Web images. This is two orders ofmagnitude larger than datasets that previous LSH parallelizations could handle.
展开▼